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Abstract. We prove that, for any arbitrary finite alphabet and for the uniform distribu- 
tion over deterministic and accessible automata with n states, the average complexity of 
Moore's state minimization algorithm is in 0(nlogn). Moreover this bound is tight in the 
case of unary automata. 



1. Introduction 

Deterministic automata are a convenient way to represent regular languages that can be 
used to efficiently perform most of usual computations involving regular languages. There- 
fore finite state automata appear in many fields of computer science, such as linguistics, 
data compression, bioinformatics, etc. To a given regular language one can associate a 
unique smallest deterministic automaton, called its minimal automaton. This canonical 
representation of regular languages is compact and provides an easy way to check equality. 
As a consequence, state minimization algorithms that compute the minimal automaton of 
a regular language, given by a deterministic automaton, are fundamental. 

Moore proposed a solution [15] that can be seen as a sequence of partition refinements. 
Starting from a partition of the set of states, of size n, into two parts, successive refine- 
ments lead to a partition whose elements are the subsets of indistinguishable sets, that can 
be merged to form a smaller automaton recognizing the same language. As there are at most 
n — 2 such refinements, each of them requiring a linear running time, the worst-case com- 
plexity of Moore's state minimization algorithm is quadratic. Hopcroft's state minimization 
algorithm [IT] also uses partition refinements to compute the minimal automaton, selecting 
carefully the parts that are split at each step. Using suitable data structures, its worst-case 
complexity is in 0(nlogn). It is the best known minimization algorithm, and therefore 
it has been intensively studied, see [H EJ EJ [12] for instance. Finally Brzozowski's algo- 
rithm [6j[7] is different from the other ones. Its inputs may be non-deterministic automata. 
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Figure 1: Time complexity of Moore's and Fi S ure 2: Number of partition refinements 

Hopcroft's algorithms in Moore's algorithm 

The results of FigfTJand Figj2]were obtained with the C++ library REGAL (available at: 
http : //regal . univ-mlv . f r/[) to randomly generate deterministic accessible automata 



[21 S] - Each value is computed from 20 000 automata over a 2-letter alphabet. 



It is based on two successive determinization steps, and though its worst-case complexity is 
proved to be exponential, it has been noticed that it is is often sub-exponential in practice. 
The reader is invited to consult |17j . which presents a taxonomy of minimization algorithms, 
for a more exhaustive list. 

In this paper we study the average time complexity of Moore's algorithm. From an 
experimental point of view, the average complexity of Moore's algorithm seems to be smaller 
than the complexity of Hopcroft's algorithm (FigfT]) and the number of partition refinements 
increases very slowly as the size of the input grows (Figj2]). In the following we mainly prove 
that in average, for the uniform distribution, Moore's algorithm performs only 0(logn) 
refinements, thus its average complexity is in 0(nlogn). 

After briefly recalling the basics of minimization of automata in Section [21 we prove 
in Section [3] that the average time complexity of Moore's algorithm is 0(nlogn) and show 
in Section H] that this bound is tight when the alphabet is unary. The paper closes with a 
short discussion about generalizations of our main theorem to Bernoulli distributions and 
to incomplete automata in Section and the presentation of a conjecture based on the slow 
growth of the number of refinements (Fig|2] when the alphabet is not unary in Section [6l 



2. Preliminaries 

This section is devoted to basic notions related to the minimization of automata. We 
refer the reader to the literature for more details about minimization of automata [10| . ll4j . rT8] . 
We only record a few definitions and results that will be useful for our purpose. 

2.1. Finite automata 

A finite deterministic automaton A is a quintuple A = (A,Q,-,qo,F) where Q is a 
finite set of states, A is a finite set of letters called alphabet, the transition function ■ is a 
mapping from Q x A to Q, qo £ Q is the initial state and F C Q is the set of final states. 
An automaton is complete when its transition function is total. The transition function can 
be extended by morphism to all words of A*: p ■ e = p for any p £ Q and for any u, v £ A*, 
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p ■ (uv) = (p ■ u) ■ v. A word u G A* is recognized by an automaton when p ■ u G F. The set 
of all words recognized by A is denoted by L(A). An automaton is accessible when for any 
state p G Q, there exists a word u £ A* such that qo ■ u = p. 

A transition structure is an automaton where the set of final states is not specified. 
Given such a transition structure T = (A,Q,-,qo) and a subset F of Q, we denote by 
(T,F) the automaton (A, Q, -, go> -F 1 )- For a given deterministic and accessible transition 
structure with n states there are exactly 2 n distinct deterministic and accessible automata 
that can be built from this transition structure. Each of them corresponds to a choice of 
set of final states. 

In the following we only consider complete accessible deterministic automata and com- 
plete accessible deterministic transition structures, except in the presentation of the gen- 
eralizations of the main theorem in Section [5j Consequently these objects will often just 
be called respectively automata or transition structures. The set Q of states of an n-state 
transition structure will be denoted by {1, • • • , n}. 

2.2. Myhill-Nerode equivalence 

Let A = (A,Q,-,qo,F) be an automaton. For any nonnegative integer i, two states 
p,q G Q are i-equivalent, denoted by p ~j q, when for all words u of length less than or 
equal to i, \p • u G F\ = \q ■ u G Fj, where the Iverson bracket [Cond] is equal to 1 if the 
condition Cond is satisfied and otherwise. Two states are equivalent when for all u G A* , 
[p • u G FJ = lq ■ u G FJ. This equivalence relation is called Myhill-Nerode equivalence. An 
equivalence relation = defined on the set of states Q of a deterministic automaton is said 
to be right invariant when 

for all u G A* and all p, q G Q, p=q^p-u = q-u. 

The following proposition |10t [1^1 [18] summarizes the properties of Myhill-Nerode equiva- 
lence that will be used in the next sections. 

Proposition 2.1. Let A = (A, Q, -,qo, F) be a deterministic automaton with n states. The 
following properties hold: 

(1) For all t £ N, ~?.+i is a partition refinement of that is, for all p,q G Q, if 
P ~i+i Q then p ~j q. 

(2) For all i G N and for all p,q G Q, p ~i+i q if and only if p ~, q and for all a G A, 
p-a^iq- a. 

(3) If for some i G N (i + 1)- equivalence is equal to i-equivalence then for every j > i, 
j-equivalence is equal to Myhill-Nerode equivalence. 

(4) (n — 2) -equivalence is equal to Myhill-Nerode equivalence. 

(5) Myhill-Nerode equivalence is right invariant. 

Let A = (A, Q,-,qo,F) be an automaton and = be a right invariant equivalence relation 
on Q. The quotient automaton of A by = is the automaton 

(AM = (A,Q/=*,[q Q ],{[f],f € F}), 

where Q/= is the set of equivalent classes, [q] is the class of q G Q, and * is defined for any 
a £ A and any q G Q by [q] * a = [q- a}. The correctness of this definition relies on the right 
invariance of the equivalence relation =. 
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Theorem 2.2. For any complete, accessible and deterministic automaton A, the automaton 
A/~ is the unique smallest automaton (in terms of the number of states) that recognizes the 
same language as the automaton A. It is called the minimal automaton of L(A). 

The uniqueness of the minimal automaton is up to labelling of the states. Theorem 12.21 
shows that the minimal automaton is a fundamental notion in language theory: it is the 
most space efficient representation of a regular language by a deterministic automaton, and 
the uniqueness defines a bijection between regular language and minimal automata. For 
instance, to check whether two regular languages are equal, one can compare their minimal 
automata. It is one of the motivations for the algorithmic study of the computation of the 
minimal automaton of a language. 

2.3. Moore's state minimization algorithm 

In this section we describe the algorithm due to Moore [15] which computes the mini- 
mal automaton of a regular language represented by a deterministic automaton. Recall that 
Moore's algorithm builds the partition of the set of states corresponding to Myhill-Nerode 
equivalence. It mainly relies on properties (2) and (3) of Proposition 12.11 the partition tt 
is initialized according to the O-equivalence ~o, then at each iteration the partition corre- 
sponding to the (i + Inequivalence is computed from the one corresponding to the 
i-equivalence ~, using property (2). The algorithm halts when no new partition refine- 
ment is obtained, and the result is Myhill-Nerode equivalence according to property (3). 
The minimal automaton can then be computed from the resulting partition since it is the 
quotient automaton by Myhill-Nerode equivalence. 

Algorithm 1: Moore 

1 if F = then 

2 | return (A, {1},*, 1,0) 

3 end 

4 if F = {1, • • • , n} then 

5 | return (A, {1}, *, 1, {1}) 

6 end 

7 forall p E {1 , • • • ,n} do 

8 | 7T'[p] = [peFj 

9 end 

10 7r = undefined 

11 while it ^ tt' do 

12 tt = tt' 

13 compute the partition tt' s.t. 
tt'[p] — 7r' [g] iff ir[p] = Tr[q] 

14 and Va <E A ir[p ■ a] = n[q ■ a] 

15 end 

16 return the quotient of A by tt 

In the description of Moore's algorithm, * denotes the function such that 1 * a = 1 for 
all a E A. Lines 1-6 correspond to the special cases where F = or F = Q. In the process, 
tt' is the new partition and tt the former one. Lines 7-9 consists of the initialization of tt' to 
the partition of ~o> n is initially undefined. Lines 11-14 are the main loop of the algorithm 



Algorithm 2: Computing tt' from tt 

1 forall p E { 1 , • • • ,n} do 

2 | s\p] = (7r[p],7r[p-oi],--- ,n\p-a k ]) 

3 end 

4 compute the permutation a that sorts the 
states according to s[] 

5 i = fl 

6 tt'\o(\)\ = i 

7 forall p E {2, • • ■ , n} do 

8 if s[p] s[p — 1] then i = i + 1 

9 ir'[cr(p)] = i 

10 end 

11 return tt' 
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where tt is set to tt' and the new tt' is computed. Line 13 is described more precisely in 
the algorithm on the right: with each state p is associated a k + 1-uple s[p] such that two 
states should be in the same part in tt' when they have the same k + 1-uple. The matches 
are found by sorting the states according to their associated string. 

The worst-case time complexity of Moore's algorithm is in 0(n 2 ). The following result 
is a more precise statement about the worst-case complexity of this algorithm that will be 
used in the proof of the main theorem (Theorem 13. ip . For sake of completeness we also give 
a justification of this statement. 

For any integer n > 1 and any m G {0, • • • ,n — 2}, we denote by An the set of 
automata with n states for which m is the smallest integer such that the m-equivalence 
~ m is equal to Myhill-Nerode equivalence. We also denote by Moore(_4.) the number of 
iterations of the main loop when Moore's algorithm is applied to the automaton A, 

Lemma 2.3. For any automaton A of A^, 

• the number of iterations Moore(.A) of the main loop in Moore's algorithm is at 
most equal to m + 1 and always less than or equal to n — 1. 

• the worst-case time complexity W{A) of Moore's algorithm is in 0((m + l)n), where 
the is uniform for m £ {0, • • • , n — 2}, or equivalently there exist two positive real 
numbers C\ and C2 independent of n and m such that C\(m + T)n < W(A) < 
C2(m + l)n. 

Proof. The result holds since the loop is iterated exactly m + 1 times when the set F 
of final states is neither empty nor equal to {1, • • • , n}. Moreover from property (4) of 
Proposition 12.11 the integer m is less than or equal to n — 2. If F is empty or equal to 
{1, • • • ,n}, then necessarily m = 0, and the time complexity of the determination of the 
size of F is 0(n). 

The initialization and the construction of the quotient are both done in O(n). The 
complexity of each iteration of the main loop is in 0(n): this can be achieved classically 
using a lexicographic sort algorithm. Moreover in this case the constants C\ and C2 do not 
depend on m, proving the uniformity of both the upper and lower bounds. ■ 

Note that Lemma [2.31 gives a proof that the worst-case complexity of Moore's algorithm 
is in 0(n 2 ), as there are no more than n — 1 iterations in the process of the algorithm. 

2.4. Probabilistic model 

The choice of the distribution is crucial for average case analysis of algorithms. Here we 
are considering an algorithm that builds the minimal automaton of the language recognized 
by a given accessible deterministic and complete one. We focus our study on the average 
complexity of this algorithm for the uniform distribution over accessible deterministic and 
complete automata with n states, and as n tends toward infinity. Note that for the uniform 
distribution over automata with n states, the probability for a given set to be the set of final 
states is equal to l/2 n . Therefore the probability that all states are final (or non-final) is 
exponentially unlikely. Some extensions of the main result to other distributions are given 
in Section [5l 

The general framework of the average case analysis of algorithms [8] is based on the 
enumeration properties of studied objects, most often given by generating functions. For 
accessible and deterministic automata, this first step is already not easy. Although the 
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asymptotic of the number of such automata is known, it can not be easily handled: a result 
from Korshunov p3], rewritten in terms of Stirling numbers of the second kind in [2] and 
generalized to possibly incomplete automata in [4], is that the number of accessible and 
deterministic automata with n states is asymptotically equal to a/5 n n^l _1 ) n where a and 
P are constants depending on the cardinality \A\ of the alphabet, and a depends on whether 
we are considering complete automata or possibly incomplete automata. 

Here some good properties of Myhill-Nerode equivalence allow us to work independently 
and uniformly on each transition structure. In this way the enumeration problem mentioned 
above can be avoided. Nevertheless it should be necessary to enumerate some subsets of 
this set of automata in order to obtain a more precise result. One refers the readers to the 
discussion of Section [6] for more details. 

3. Main result 

This section is devoted to the statement and the proof of the main theorem. 

Theorem 3.1. For any fixed integer k > 1 and for the uniform distribution over the 
accessible deterministic and complete automata of size n over a k -letter alphabet, the average 
complexity of Moore's state minimization algorithm is 0(nlogn). 

Note that this bound is independent of k, the size of the alphabet considered. Moreover, 
as we shall see in Section [4] it is tight in the case of a unary alphabet. 

Before proving Theorem 13. II we introduce some definitions and preliminary results. Let 
T be a fixed transition structure with n states and I be an integer such that 1 < £ < n. 
Let p, q,p' , q' be four states of T such that p ^ q and p' / q' . We define ^(p, q,p' , q') as 
the set of sets of final states F for which in the automaton (T, F) the states p and q are 
{I — l)-equivalent, but not ^-equivalent, because of a word of length £ mapping p to p' and 
q to q' where p' and q' are not O-equivalent. In other words J~e(p, q,p', q') is the following 
set: 

Fe(p,q,p',(/) = {F C {1, • • • ,n} I for the automaton (T,F), p ~ t _ x q, \p' ef] / \q G F\, 

3u E A 1 , p ■ u = p and q ■ u = q'} 

Note that when £ grows, the definition of Tt is more constrained and consequently fewer 
non-empty sets Ti exist. 

From the previous set Tiip, q,p' , q') one can define the undirected graph Ge(p, q,p' , q'), 
called the dependency graph, as follows: 

• its set of vertices is {1, • • • , n}, the set of states of T; 

• there is an edge (s,t) between two vertices s and t if and only if for all F G 
F e (p,q,p',q'),ls£F] = ltGF}. 

The dependency graph contains some information that is a basic ingredient of the proof: 
it is a convenient representation of necessary conditions for a set of final states to be in 
3~i(p,q,p' ,q'), that is, for Moore's algorithm to require more than £ iterations because of 
p, q, p' and q' . These necessary conditions will be used to give an upper bound on the 
cardinality of Feip, q,p', q') in Lemma [3731 

Lemma 3.2. For any integer £ £ {l, - " > n— 1} an d an V states p,q,p' ,q' G {1, • • • ,n} 

with p 7^ q, p' ^ q' such that T^{p, q,p' , q') is not empty, there exists an acyclic subgraph of 
Ge(p,q,p' ,q') with £ edges. 
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Proof. If Tiip, q,p', q') is not empty, let u = m ■ ■ ■ U£ with m G A be the smallest (for the 
lexicographic order) word of length £ such that p ■ u = p' and q ■ u = q' . Note that every 
word u of length £ such that p ■ u = p' and q ■ u = q' can be used. But a non-ambiguous 
choice of this word u guarantees a complete description of the construction. 

For every i G {0, ■■■,£— 1}, let G^ be the subgraph of Ge(p,q,p' ,q') whose edges are 
defined as follows. An edge (s, t) is in Qm if and only if there exists a prefix v of u of length 
less than or equal to i such that s = p ■ v and t = q ■ v. In other words the edges of Qg^ are 
exactly the edges (p-v,q-v) between the states p ■ v and q ■ v where v ranges over the prefixes 
of u of length less than or equal to i. Such edges belong to Qt{p,q,p' \q') since p q. 
Moreover, the graphs (G^,i)n<i<£-1 have the following properties: 

(1) For each i G {0, ■■■,£— 2}, G^i is a strict subgraph of G^+i. The graph G^j + i is 
obtained from Gn by adding an edge from p ■ w to q ■ w, where w is the prefix of u 
of length i + 1. This edge does not belong to Gn, for otherwise there would exist 
a strict prefix z of u> such that either p ■ z = p ■ w and q ■ z = q ■ w or p ■ z = q ■ w 
and q ■ z = p ■ w. In this case, let u/ be the word such that u = u>u/, then either 
p ■ zw' = p' and q ■ zw' = q' or p ■ zw' = q' and q ■ zw' = p' . Therefore there 
would exist a word of length less than £, zw' , such that, for F G Feip, q,p', q'), 
\p ■ zw' eF]/ \q' ■ zw' G FJ which is not possible since p q and J^eip, q,p' , q') 
is not empty. Hence this edge is a new one. 

(2) For each i € {0, • ■ ■ ,£—1}, Gn^ contains i + 1 edges. It is a consequence of property 
(1), since G^o has only one edge between p and q. 

(3) For each i £ {0, •• • , £— 1}, Gn^ contains no loop. Indeed p ■ v / q ■ v for any prefix of 
u since p ^ q for any automaton (T, F) with i 7 G J~e(p, q,p' <?')> which is not empty. 

(4) For each i G {0, • • • , £ — 1}, if there exists a path in Gi^ from s to t, then s ~£_i_, £ 
in every automata (T,F) with F G </)• This property can be proved by 
induction. 

We claim that every G^j is acyclic. Assume that it is not true, and let j > 1 be the 
smallest integer such that Ggj contains a cycle. By property (1), Gij is obtained from 
j_i by adding an edge between p ■ w and q ■ w where w is the prefix of length j of u. 
As Gij-i is acyclic, this edge forms a cycle in Gij. Hence in Ggj^i there already exists a 
path between p ■ w and q ■ w. Therefore by property (4) p ■ w ~e-j q • w in any automaton 
(T, F) with F G Ti{p,q,p' \q')- Let w' be the word such that u = ww' . The length of w' 
is £ — j, hence p • u and g • ii are both in F or both not in F, which is not possible since 
F G Ft(p,q,p' ,q'). 

Thus Gi : £_i is an acyclic subgraph of Gt{p,q,p' ,q') with £ edges according to property 
(2), which concludes the proof. ■ 

Lemma 3.3. Given a transition structure T of size n > 1 and an integer £ with 1 < £ < n, 
for all states p, q,p' , q' of T with p ^ q and p' ^ q' the following result holds: 

\Fe(p,q,p',q')\<2 n - i . 

Proof. If J-g(p,q,p' ,q') is empty, the result holds. Otherwise, from Lemma 13.21 there exists 
an acyclic subgraph G of Gi{p,q,p' \q') with £ edges. Let m be the number of connected 
components of G that are not reduced to a single vertex. The states in such a component 
are either all final or all non-final. Therefore there are at most m choices to make to 
determine whether the states in those components are final or non-final. As the graph G 
is acyclic, there are exactly m + £ vertices that are not isolated in G. Hence there are at 
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(a) (b) 

Figure 3: Illustration of the proof of Lemma 13.21 for n = 9, I = 5, p = 3, q = 7, p' = 3 
and q' = 8 on a given transition structure, (a) u = abbaa is the smallest word of 
length 5, for the lexicographic order, such that 3 • u = 3 and 7 ■ u = 8. The set 
•7-5(3, 7, 3, 8) is not empty, as it contains {4, 8}. The bold transitions are the ones 
followed when reading it from p and from q. (b) The construction of an acyclic 
subgraph of ^(3,7,3,8) with 5 edges. To each strict prefix v of u = abbaa is 
associated an edge between 3 • v and 7 • v. It encodes some necessary conditions 
for a set of final states F to be in ^5 (3, 7, 3, 8), as two states in the same connected 
component must be either both final or both not final. 



most 2 m 2 n ( m+e ) = 2 n 1 elements in J T e(p,Q,p' ,q')'- 2 m corresponds to the possible choices 
for the connected components and 2 n ~( m+ ^ to the choices for the isolated vertices. ■ 

Proposition 3.4. Let k > 1. There exists a positive real constant C such that for any 
positive integer n and any deterministic and complete transition structure T of size n over 
a k-letter alphabet, for the uniform distribution over the sets F of final states, the average 
number of iterations of the main loop of Moore's algorithm applied to (T, F) is upper bounded 
by Clogra. 

Proof. Let T be a deterministic and complete transition structure of size n over a fc-letter 
alphabet. Denote by J--^ the set of sets F of final states such that the execution of Moore's 
algorithm on (T, F) requires more than £ iterations or equivalently such that (T, F) € 
with m>£ (see Section [2.31 for notation). 

A necessary condition for F to be in T^- 1 is that there exist two states p and q with 
p 7^ q and such that p q and p 9^ q. Therefore there exists a word u of length I such 
that jp ■ uj 7^ \q ■ it] . Hence F G !F^(p,q,p ■ u,q ■ u) and 

T- l = (J Ti{p,q,P ',<?')• 

P,9,P',?'6{1,- M 

p¥=q, p'¥=q' 

In this union the sets J-e(p, q,p' , q') are not disjoint, but this characterization of T- is pre- 
cise enough to obtain a useful upper bound of the cardinality of T-^. From the description 
of T- 1 we get 

1^1 < £ \F(p,q,p',q% 

p,q,p',q' £{!,-■■ ,n} 

p¥=q, p'¥=q' 
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and using Lemma 13.31 and estimating the number of choices of the four points p, q,p' , q' , we 
have 

\^ e \ < n{n - 1) n{p - l)2 n ~ £ < n 4 2 n " f . (3.1) 
For a fixed integer t and for the uniform distribution over the sets F of final states, the 
average number of iterations of the main loop of Moore's algorithm is 

i ]T MOORE(T,F) = ^ £ Moore(T,F) + ^ £ Moore(T, F), 

Fc{l,...,n} F^T< 1 FeF> e 

where J- <e is the complement of in the set of all subsets of states. Moreover by 
LemmaE31 for any F 6 F <e , Moore(T, F) < t. Therefore, since \T <1 \ < 2 n 



± £ Moore(T,F)<^ 



2 



Using Lemma 12.31 again to give an upper bound for Moore(7~, F) when F G and the 
estimate of \J-- | given by Equation 13.11 we have 



— Moore(T,F) < n 5 2" 



2 

FdT> 1 

Finally, choosing I = [5 log 2 n \ , we obtain that there exists positive real C such that 
— Moore(T, F) < \5 log 2 n] + n 5 2" r5 log2 nl < C log n, 

fC{l,..,ti} 

concluding the proof. ■ 

Now we prove Theorem 13.11 
Proof of the main theorem: Let T n denote the set of deterministic, accessible and com- 
plete transition structures with n states. For a transition structure T S T n , there are exactly 
2™ distinct automata (T,F). 

Recall that the set A n of deterministic, accessible and complete automata with n states 
is in bijection with the pairs (T, F) consisting of a deterministic, accessible and complete 
transition structure T £ T n with n states and a subset F C {1, • • • , re} of final states. 
Therefore, for the uniform distribution over the set A n , the average number of iterations of 
the main loop when JVIoore's algorithm is applied, to an element of A n is 

-J- Y, MoORE (^) = ^jU E E Moore(t,f) 

1 n| AeA n 1 n| TeT„ fc{i,- M 

Using Proposition 13.41 we get 

Y Moore(^I) < — Clogn < Clogn. 
|,An| AeA n |Tn| r 6 r„ 

Hence the average number of iterations is bounded by C log n, and by Lemma 12.31 the 
average complexity of Moore's algorithm is upper bounded by CiCnlogn, concluding the 
proof. □ 



132 



F. BASSINO, J. DAVID, AND C. NICAUD 



4. Tight bound for unary automata 

In this section we prove that the bound 0(n log n) is optimal for the uniform distribution 
on unary automata with n states, that is, automata on a one-letter alphabet. 

We shall use the following result on words, whose proof is given in detail in [81 p. 285]. 
For a word u on the binary alphabet {0,1}, the longest run of 1 is the length of the longest 
consecutive block of l's in u. 

Proposition 4.1. For any real number h and for the uniform distribution on binary words 
of length n, the probability that the longest run of 1 is smaller than [log 2 n + h\ is equal to 

where the is uniform on h, and a(n) = 2 lo s ™ — Li°e . 

Corollary 4.2. For the uniform distribution on binary words of length n, the probability 
that the longest run of 1 is smaller than [^log 2 nJ is smaller than e~^/ 2 . 

Proof. Set h = — ^log 2 re in Proposition 14. II and use that for any integer n, a(n) > 1. ■ 

The shape of an accessible deterministic and complete automaton with n states on a 
one-letter alphabet A = {a} is very specific. If we label the states using the depth-first 
order, then for all q G {1, • • ■ , n — 1} q ■ a = q + 1. The state n ■ a entirely determines the 
transition structure of the automaton. Hence there are n2 n distinct unary automata with 
n states. We shall also use the following result from [16J: 

Proposition 4.3. For the uniform distribution on unary automata with n states, the prob- 
ability that an automaton is minimal is asymptotically equal to \ . 

We can now prove the optimality of the 0(ralogn) bound for unary automata: 

Theorem 4.4. For the uniform distribution on unary automata with n states, the average 
time complexity of Moore's state minimization algorithm is 0(ralogn). 

Proof. From Theorem 13.11 this time complexity is 0(ralogre). It remains to study the lower 
bound of the average time complexity of Moore's algorithm. 

For any binary word u of size n, we denote by F{u) the subset of {1, • • • , n} such that 
i G F(u) if and only if the i-th letter of u is 1. The map F is clearly a bijection between 
the binary words of length n and the subsets of {1, • • • ,n}. Therefore a unary automaton 
with n states is completely defined by a word u of length n, encoding the set of final states, 
and an integer m G {1, • • • , re} corresponding to re • a; we denote such an automaton by the 
pair (u, m) G {0, 1}™ x {1, • • • , re}. Let £ be the integer defined by £ = log 2 nj . Let M n 
be the set of minimal unary automata with n states, and S n be the subset of M n defined 
by 

S n = {(u, m) G M n | the longest run of 1 in u is smaller than £} 
As the number of element in S n is smaller than the number of automata (u, m) whose 
longest run of 1 in u is smaller than £, from Corollary 14. 2\ we have \S n \ = o(n2 n ). Let 
(u,m) be a minimal automaton in M n \ S n . The word u has a longest run of 1 greater 
or equal to £. Let p G {1, • • • ,n} be the index of the beginning of such a longest run in 
u. The states p and p + 1 requires £ iterations in Moore's algorithm to be separated, as 
p ■ a 1 and (p + 1) • a 1 are both final for every i G {0, • • • ,£ — 2}. They must be separated 
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by the algorithm at some point since (u,m) is minimal. Hence MoORE((n,m)) > £ for any 
(u, in) £ M n \ S n . Therefore 

^ MoORE((u,m))>-^ MoORE((«,m)) 

(M,m)e{0,l} n X {1,- ,n} (u,m)£M n \S n 

>^|M„\5„| f >_L|M„| f --L|S„| f 
> \t-o{t) 

The last inequality is obtained using Proposition 14.31 concluding the proof since by hypoth- 
esis i= Y\ log2 n\ . m 



5. Extensions 

In this section we briefly present two extensions of Proposition 13.41 and Theorem 13.11 

5.1. Bernoulli distributions for the sets of final states 

Let p be a fixed real number with < p < 1. Let T be a transition structure with 
n states. Consider the distribution on the sets of final states for T defined such that each 
state as a probability p of being final. The probability for a given subset F of {1, • • • , n} 
to be the set of final states is F(F) = -p) n ~\ F \. 

A statement analogous to Proposition 13.41 still holds in this case. The proof is similar 
although a bit more technical, as Proposition 13.41 corresponds to the special case where 
p = ^. Hence, for this distribution of sets of final states, the average complexity of Moore's 
algorithm is also 0(nlogn). 

5.2. Possibly incomplete automata 

Now consider the uniform distribution on possibly incomplete deterministic automata 
with n states and assume that the first step of Moore's algorithm applied to an incomplete 
automaton consists in the completion of the automaton making use of a sink state. In this 
case Proposition 13.41 still holds. Indeed, Lemma 13.31 is still correct, even if the sets of final 
states F are the sets that do not contain the sink state. As a consequence, if a transition 
structure T is incomplete, the average complexity of Moore's algorithm for the uniform 
choice of set of final states of the completed transition structure, such that the sink state is 
not final, is in 0((n + 1) log(n + 1)) = 0(n log n). 
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6. Open problem 

We conjecture that for the uniform distribution on complete, accessible and determinis- 
tic automata with n states over a /c-letter alphabet, with k > 2, the average time complexity 
of Moore's algorithm is in 0(re log logn). 

This conjecture comes from the following observations. First, Figure [2] seems to show 
a sub-logarithmic asymptotic number of iterations in Moore's algorithm. Second, if the 
automaton with n states is minimal, at least SI (log log ra) iterations are required to isolate 
every state: logn words are needed, and this can be achieved in the best case using all 
the words of length less than or equal to log logn. Moreover, in [2j we conjectured that 
a constant part of deterministic automata are minimal; if it is true, this would suggest 
that f2(loglogn) is a lower bound for the average complexity of Moore's algorithm. The 
conjecture above is that this lower bound is tight. 
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